11 research outputs found
Non-hierarchical Structures: How to Model and Index Overlaps?
Overlap is a common phenomenon seen when structural components of a digital
object are neither disjoint nor nested inside each other. Overlapping
components resist reduction to a structural hierarchy, and tree-based indexing
and query processing techniques cannot be used for them. Our solution to this
data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a
novel extension of the XML data model for non-hierarchical structures. We
introduce an algorithm for constructing TGSA from annotated documents; the
algorithm can efficiently process non-hierarchical structures and is associated
with formal proofs, ensuring that transformation of the document to the data
model is valid. To enable high performance query analysis in large data
repositories, we further introduce an extension of XML pre-post indexing for
non-hierarchical structures, which can process both reachability and
overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc
Personal Entity, Concept, and Named Entity Linking in Conversations
Building conversational agents that can have natural and knowledge-grounded
interactions with humans requires understanding user utterances. Entity Linking
(EL) is an effective and widely used method for understanding natural language
text and connecting it to external knowledge. It is, however, shown that
existing EL methods developed for annotating documents are suboptimal for
conversations, where personal entities (e.g., "my cars") and concepts are
essential for understanding user utterances. In this paper, we introduce a
collection and a tool for entity linking in conversations. We collect EL
annotations for 1327 conversational utterances, consisting of links to named
entities, concepts, and personal entities. The dataset is used for training our
toolkit for conversational entity linking, CREL. Unlike existing EL methods,
CREL is developed to identify both named entities and concepts. It also
utilizes coreference resolution techniques to identify personal entities and
references to the explicit entity mentions in the conversations. We compare
CREL with state-of-the-art techniques and show that it outperforms all existing
baselines
Data Augmentation for Conversational AI
Advancements in conversational systems have revolutionized information
access, surpassing the limitations of single queries. However, developing
dialogue systems requires a large amount of training data, which is a challenge
in low-resource domains and languages. Traditional data collection methods like
crowd-sourcing are labor-intensive and time-consuming, making them ineffective
in this context. Data augmentation (DA) is an affective approach to alleviate
the data scarcity problem in conversational systems. This tutorial provides a
comprehensive and up-to-date overview of DA approaches in the context of
conversational systems. It highlights recent advances in conversation
augmentation, open domain and task-oriented conversation generation, and
different paradigms of evaluating these models. We also discuss current
challenges and future directions in order to help researchers and practitioners
to further advance the field in this area
Conversational Entity Linking: Problem Definition and Datasets
Machine understanding of user utterances in conversational systems is of utmost importance for enabling engaging and meaningful conversations with users. Entity Linking (EL) is one of the means of text understanding, with proven efficacy for various downstream tasks in information retrieval. In this paper, we study entity linking for conversational systems. To develop a better understanding of what EL in a conversational setting entails, we analyze a large number of dialogues from existing conversational datasets and annotate references to concepts, named entities, and personal entities using crowdsourcing. Based on the annotated dialogues, we identify the main characteristics of conversational entity linking. Further, we report on the performance of traditional EL systems on our Conversational Entity Linking dataset, ConEL, and present an extension to these methods to better fit the conversational setting. The resources released with this paper include annotated datasets, detailed descriptions of crowdsourcing setups, as well as the annotations produced by various EL systems. These new resources allow for an investigation of how the role of entities in conversations is different from that in documents or isolated short text utterances like queries and tweets, and complement existing conversational datasets.publishedVersio
Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases
Automatic extraction of funding information from academic articles adds
significant value to industry and research communities, such as tracking
research outcomes by funding organizations, profiling researchers and
universities based on the received funding, and supporting open access
policies. Two major challenges of identifying and linking funding entities are:
(i) sparse graph structure of the Knowledge Base (KB), which makes the commonly
used graph-based entity linking approaches suboptimal for the funding domain,
(ii) missing entities in KB, which (unlike recent zero-shot approaches)
requires marking entity mentions without KB entries as NIL. We propose an
entity linking model that can perform NIL prediction and overcome data scarcity
issues in a time and data-efficient manner. Our model builds on a
transformer-based mention detection and bi-encoder model to perform entity
linking. We show that our model outperforms strong existing baselines.Comment: Accepted at COLING 202
MMEAD: MS MARCO Entity Annotations and Disambiguations
MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for
entity links for the MS MARCO datasets. We specify a format to store and share
links for both document and passage collections of MS MARCO. Following this
specification, we release entity links to Wikipedia for documents and passages
in both MS MARCO collections (v1 and v2). Entity links have been produced by
the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing
users to load the link data and entity embeddings effortlessly. Using MMEAD
takes only a few lines of code. Finally, we show how MMEAD can be used for IR
research that uses entity information. We show how to improve recall@1000 and
MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this
resource. We also demonstrate how entity expansions can be used for interactive
search applications
Semantic Search with Knowledge Bases
Over the past decade, modern search engines have made significant progress towards better understanding searchers' intents and providing them with more focused answers, a paradigm that is called \semantic search." Semantic search is a broad area that encompasses a variety of tasks and has a core enabling data component, called the knowledge base. In this thesis, we utilize knowledge bases to address three tasks involved in semantic search: (i) query understanding, (ii) entity retrieval, and (iii) entity summarization.
Query understanding is the first step in virtually every semantic search system. We study the problem of identifying entity mentions in queries and linking them to the corresponding entries in a knowledge base. We formulate this as the task of entity linking in queries, propose refinements to evaluation measures, and publish a test collection for training and evaluation purposes. We further establish a baseline method for this task through a reproducibility study, and introduce different methods with the aim to strike a balance between efficiency and effectiveness.
Next, we turn to using the obtained annotations for answering the queries. Here, our focus is on the entity retrieval task: answering search queries by returning a ranked list of entities. We introduce a general feature-based model based on Markov Random Fields, and show improvements over existing baseline methods. We find that the largest gains are achieved for complex natural language queries.
Having generated an answer to the query (from the entity retrieval step), we move on to presentation aspects of the results. We introduce and address the novel problem of dynamic entity summarization for entity cards, by breaking it into two subtasks, fact ranking and summary generation. We perform an extensive evaluation of our method using crowdsourcing, and show that our supervised fact ranking method brings substantial improvements over the most comparable baselines.
In this thesis, we take the reproducibility of our research very seriously. Therefore, all resources developed within the course of this work are made publicly available. We further make two major software and resource contributions: (i) the Nordlys toolkit, which implements a range of methods for semantic search, and (ii) the extended DBpedia-Entity test collection
Gridvoronoi: An Efficient Spatial Index for Nearest Neighbor Query Processing
Contains fulltext :
208505.pdf (publisher's version ) (Open Access